Project Overview & Objective:
AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
Objective
Data Dictionary
Scoring Rubric:
# Importing the Necessary Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
!pip install nb-black
# Loading the Dataset
pd.set_option('display.max_columns', None)
from google.colab import files
data_to_load = files.upload()
import io
df = pd.read_csv(io.BytesIO(data_to_load['Loan_Modelling.csv']))
# Looking at the Shape of the Dataset
print(f'There are {df.shape[0]} rows and {df.shape[1]} columns.')
# Taking an initial look
df.info()
# Looking at the types in the Dataset
df.dtypes
We have 14 columns in the dataset and 5,000 rows. 13 columns have datatype int and 1 column (CCAvg) has datatype float.
# Confirming df has no null
df.isnull().values.any()
# Since missing values can also be 0's in the dataset I'm checking for abnormal amounts of 0's in each of the columns
for column_name in df.columns:
column = df[column_name]
# Get the count of Zeros in column
count = (column == 0).sum()
print('Count of zeros in column ', column_name, ' is : ', count)
# Taking a look at the first 10 rows
df.head(10)
# Taking a look at the last 10 rows
df.tail(10)
The 0's present in the data-set don't seem to be abnormal because it makes sense for the context of the columns they are in (Experience, CCAvg, Mortage, PersonalLoan, Securities_Account, CD_Account, Online, CreditCard) since the values will be 0 if no / N/A.
# Checking for dulicates in df
df.duplicated().sum()
There are no duplicate entries in df
# Checking important information of the dataframe columns
df.describe(include='all').T
# Checking for the number of unique values in each of the columns / what the values are
print(df.ID.value_counts())
print(df.Age.value_counts())
print(df.Experience.value_counts())
print(df.Income.value_counts())
print(df.ZIPCode.value_counts())
print(df.Family.value_counts())
print(df.CCAvg.value_counts())
print(df.Education.value_counts())
print(df.Mortgage.value_counts())
print(df.Personal_Loan.value_counts())
print(df.Securities_Account.value_counts())
print(df.CD_Account.value_counts())
print(df.Online.value_counts())
print(df.CreditCard.value_counts())
# Installiing Pandas Profiling
!pip install -U pandas_profiling
# Importing ProfileReport
from pandas_profiling import ProfileReport
# Generating a Pandas ProfileReport to gain some more initial insights
df.profile_report()
Looking more closely/visualizing at each of the individual columns (except for customer ID since we know it is unique and uniform)
Age Column
sns.histplot(x=df["Age"], kde = True)
sns.boxplot(x=df["Age"], showfliers=True, fliersize=5)
The values for Age look reasonable and there doesn't appear to be any outliers.
Experience Column
sns.histplot(x=df["Experience"], kde = True)
sns.boxplot(x=df["Experience"], showfliers=True, fliersize=5)
The values for Experience also look reasonable but there are some negative values which don't make sense (as seen earlier) so I will treat those.
df.describe()
We see a confirmation of this since the minimum value for Experience is -3
# Setting negative Experiences values to zero
df.loc[df.Experience < 0, 'Experience'] = 0
# Checking our changes
df.describe()
# Confirming our changes by visualizing the updated Histogram
sns.histplot(x=df["Experience"], kde = True)
We see the negative values have been fixed and set to zero.
Income Column
sns.histplot(x=df["Income"], kde = True)
sns.boxplot(x=df["Income"], showfliers=True, fliersize=5)
The values for Income seem reasonable but the distribution is right-skewed due to the presence of some higher income values. Some outliers exist on the right-side but the presence of higher incomes makes sense and seems important to incorporate in the models so we will not remove these.
ZIPCode Column
sns.histplot(x=df["ZIPCode"], kde = True)
sns.boxplot(x=df["ZIPCode"], showfliers=True, fliersize=5)
The values in the ZIPCode column are reasonable. I was having trouble with the USZipCode library so I decided the leave the values are they are and not make them a categorical variable either since there are so many unique ZIPCode values.
Family Column
sns.histplot(x=df["Family"])
sns.countplot(data=df, x='Family');
plt.xticks(rotation=90)
There are 4 distinct values for family-size (1, 2, 3, 4) which makes sense. As we saw earlier, 1 is the most frequent, 4 is the next most frequent, then 2, and finally 3.
CreditCard Average Column
sns.histplot(x=df["CCAvg"], kde = True)
sns.boxplot(x=df["CCAvg"], showfliers=True, fliersize=5)
The values for average credit card expenses per month make sense. The distribution is right-skewed due to the presence of customers who have significantly higher spending habits and some of these are shown as outliers in the box-plots, but these points make sense and seem important to include in the model so I will not remove them.
Education Column
sns.histplot(x=df["Education"])
sns.countplot(data=df, x='Education');
plt.xticks(rotation=90)
There are 3 distinct values for Education 1: Undergrad, 2: Graduate, 3: Advanced/Professional. As we saw earlier, Undergrad (1) has the most frequency, Advanced/Professional (3) is next, then finally Graduate (2).
Mortgage Column
sns.histplot(x=df["Mortgage"], kde = True)
sns.boxplot(x=df["Mortgage"], showfliers=True, fliersize=5)
The distribution for Mortgage is right-skewed due to the presence of many 0 values. We saw earlier that this makes sense because that indicates some customers do not have a house mortgage. We also see the skewness is due to the presence of customers with very high values for their house mortgages. These values are shown as outliers in the box-plot but the values make sense and seem important/significant to include in the model so I will not remove them.
Personal Loan Column (Target Variable)
sns.histplot(x=df["Personal_Loan"])
sns.countplot(data=df, x='Personal_Loan');
plt.xticks(rotation=90)
There are two distinct values for Personal Loan: 0 (the customer did not accept the personal loan offered in the last campaign) and 1 (the customer did accept the personal loan offered in the last campaign). There are more 0 values indicating more customers did not accept the personal loan. This will be our target variable in the models.
Securities Account Column
sns.histplot(x=df["Securities_Account"])
sns.countplot(data=df, x='Securities_Account');
plt.xticks(rotation=90)
There are two distinct values for Securities Account: 0 (the customer does not have a securities account with the bank) and 1 (the customer does have a securities account with the bank). There are more 0 values which indicates that more customers do not have securities accounts with the bank and these values make sense.
CD Account Column Column
sns.histplot(x=df["CD_Account"])
sns.countplot(data=df, x='CD_Account');
plt.xticks(rotation=90)
CD_Account has two distinct values: 0 (the customer does not have a certificate of deposit (CD) account with the bank) and 1 (the customer does have a certificate of deposit (CD) account with the bank). There are more 0 values indicating there are more customers who do not have a a certificate of deposit (CD) account with the bank. One of the business goals is to convert liability customers to personal loan customers while retaining them as depositors so we'll need to keep that in mind.
Online Column
sns.histplot(x=df["Online"])
sns.countplot(data=df, x='Online');
plt.xticks(rotation=90)
There are two distinct values for Online: 0 (the customer does not use internet banking facilities) and 1 (the customer does use internet banking facilities). There are more 1 valus indicating more customers use internet banking facilities than those who don't. This makes sense and will be useful information to know when giving insights to the business.
Credit Card Column
sns.histplot(x=df["CreditCard"])
sns.countplot(data=df, x='CreditCard');
plt.xticks(rotation=90)
There are two distinct values for CreditCard: 0 (the customer does not use a credit card issued by any other Bank) and 1 (the customer does use a credit card issued by another bank). There are more 0 values than 1's, indicating that more customers do not use a credit card issued by another bank than the number of customers that do.
Next:
# Creating a PairPlot
sns.pairplot(df,diag_kind='kde')
# Creating a heatmat to observe correlations between the columns
plt.figure(figsize = (16,8))
sns.heatmap(data=df.corr(), annot=True);
We see a confirmation of what we saw in the Pandas ProfileReport:
Looking closer at each of these relationships through visualizations
sns.lineplot(data = df , x = 'Age' , y = 'Experience');
plt.xticks(rotation=90);
sns.scatterplot(data=df, x='Age', y='Experience',hue="Personal_Loan");
plt.xticks(rotation=90);
plt.legend(bbox_to_anchor=(1.5, 1), borderaxespad=0)
Age and Experience appear to have a mostly strong, direct linear relationship
sns.lineplot(data = df , x = 'Income' , y = 'CCAvg');
plt.xticks(rotation=90);
sns.scatterplot(data=df, x='Income', y='CCAvg',hue="Personal_Loan");
plt.xticks(rotation=90);
plt.legend(bbox_to_anchor=(1.5, 1), borderaxespad=0)
As Income increases, CCAvg seems to increase but from the scatterplot we see that is not always the case. We also see that people with higher incomes seem to be more likely to have accepted a personal loan which makes sense.
sns.lineplot(data = df , x = 'Income' , y = 'Mortgage');
plt.xticks(rotation=90);
sns.scatterplot(data=df, x='Income', y='Mortgage',hue="Personal_Loan");
plt.xticks(rotation=90);
plt.legend(bbox_to_anchor=(1.5, 1), borderaxespad=0)
As Income increases, Mortgage values seem to increase too which makes sense. But from the scatterplot we see higher income values with a lower corresponding Mortgage value, this makes sense because people with higher incomes are able to pay off their Mortgage and bring the values down/close to (if not) zero. We see the highest income values have Mortgage values of 0 which confirms this.
sns.scatterplot(data=df, x='Personal_Loan', y='Income');
plt.xticks(rotation=90);
plt.legend(bbox_to_anchor=(1.5, 1), borderaxespad=0)
People within the middle range of Income appear to be more likely to accept a personal loan than people on the upper or lower bounds of the Income range. This makes sense because people with a higher income have no need for a loan, whereas people in the lower range probably do not want to take a large financial risk.
sns.scatterplot(data=df, x='Personal_Loan', y='CCAvg');
plt.xticks(rotation=90);
plt.legend(bbox_to_anchor=(1.5, 1), borderaxespad=0)
People with higher CCAvg values are more likely to accept a personal loan than people with lower monthly credit card averages. People with low-middle CCAvg valus also seem to accept personal loans.
# Preparing the Split
X = df.drop('Personal_Loan',axis=1)
Y = df['Personal_Loan']
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
# Checking the Split
print("{0:0.2f}% data is in training set".format((len(x_train)/len(df.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(x_test)/len(df.index)) * 100))
# Checking the values of Personal Loan in all 3 datasets
print("Original Personal_Loan True Values : {0} ({1:0.2f}%)".format(len(df.loc[df['Personal_Loan'] == 1]), (len(df.loc[df['Personal_Loan'] == 1])/len(df.index)) * 100))
print("Original Personal_Loan False Values : {0} ({1:0.2f}%)".format(len(df.loc[df['Personal_Loan'] == 0]), (len(df.loc[df['Personal_Loan'] == 0])/len(df.index)) * 100))
print("")
print("Training Personal_Loan True Values : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 1]), (len(y_train[y_train[:] == 1])/len(y_train)) * 100))
print("Training Personal_Loan False Values : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 0]), (len(y_train[y_train[:] == 0])/len(y_train)) * 100))
print("")
print("Test Personal_Loan True Values : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 1]), (len(y_test[y_test[:] == 1])/len(y_test)) * 100))
print("Test Personal_Loan False Values : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 0]), (len(y_test[y_test[:] == 0])/len(y_test)) * 100))
print("")
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
plot_confusion_matrix,
make_scorer,
)
# Function to check model performance using metrics
def model_performance_classification_sklearn_with_threshold(model, predictors, target, threshold=0.5):
"""
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# predicting using the independent variables
pred_prob = model.predict_proba(predictors)[:, 1]
pred_thres = pred_prob > threshold
pred = np.round(pred_thres)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
# Function to plot the Confusion Matrix with threshold as an argument
def confusion_matrix_sklearn_with_threshold(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix, based on the threshold specified, with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
pred_prob = model.predict_proba(predictors)[:, 1]
pred_thres = pred_prob > threshold
y_pred = np.round(pred_thres)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# Creating the model
model = LogisticRegression(solver="newton-cg", random_state=1)
logistic = model.fit(x_train, y_train)
Checking performance on the training set
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(logistic, x_train, y_train)
log_reg_model_train = model_performance_classification_sklearn_with_threshold(
logistic, x_train, y_train
)
print("Training performance:")
log_reg_model_train
Checking performance on the test set
# Creating confusion matrix
confusion_matrix_sklearn_with_threshold(logistic, x_test, y_test)
log_reg_model_test = model_performance_classification_sklearn_with_threshold(
logistic, x_test, y_test
)
print("Training performance:")
log_reg_model_test
We see we have a very poor recall score for both models which can be improved.
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
precision_recall_curve,
roc_curve,
)
# Training Set
logit_roc_auc_train = roc_auc_score(y_train, logistic.predict_proba(x_train)[:, 1])
fpr, tpr, thresholds = roc_curve(y_train, logistic.predict_proba(x_train)[:, 1])
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
# Test Set
logit_roc_auc_test = roc_auc_score(y_test, logistic.predict_proba(x_test)[:, 1])
fpr, tpr, thresholds = roc_curve(y_test, logistic.predict_proba(x_test)[:, 1])
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_test)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
# Setting the Optimal threshold according to the AUC-ROC curve (where tpr is high and fpr is low)
fpr, tpr, thresholds = roc_curve(y_train, logistic.predict_proba(x_train)[:, 1])
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
# Plotting confusion matrix of the model with the threshold modified for the training set
confusion_matrix_sklearn_with_threshold(
logistic, x_train, y_train, threshold=optimal_threshold_auc_roc
)
# checking model performance for training set
log_reg_model_train_threshold_auc_roc = model_performance_classification_sklearn_with_threshold(
logistic, x_train, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_threshold_auc_roc
# Plotting confusion matrix of the model with the threshold modified for the test set
confusion_matrix_sklearn_with_threshold(
logistic, x_test, y_test, threshold=optimal_threshold_auc_roc
)
# checking model performance for test set
log_reg_model_test_threshold_auc_roc = model_performance_classification_sklearn_with_threshold(
logistic, x_test, y_test, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_test_threshold_auc_roc
We see that the recall score has improved significantly by setting the optimal threshold according to AUC-ROC curve. Let's see if we can improve it throuugh the precision-recall curve to find a better threshold value.
y_scores = logistic.predict_proba(x_train)[:, 1]
prec, rec, tre = precision_recall_curve(y_train, y_scores,)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
We see that around 0.25 we get a higher recall score and good precision score so we will use this as our threshold. We would get equal precision and recall scores with a threshold of around 0.32.
# Setting the threshold value
optimal_threshold_curve = 0.25
# Checking the scores of the new model for the training set
log_reg_model_train_threshold_curve = model_performance_classification_sklearn_with_threshold(
logistic, x_train, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_threshold_curve
# Plotting a confusion matrix for the training set
confusion_matrix_sklearn_with_threshold(
logistic, x_train, y_train, threshold=optimal_threshold_curve
)
# Checking the scores of the new model for the test set
log_reg_model_test_threshold_curve = model_performance_classification_sklearn_with_threshold(
logistic, x_train, y_train, threshold=optimal_threshold_curve
)
print("Test performance:")
log_reg_model_test_threshold_curve
# Plotting a confusion matrix for the test set
confusion_matrix_sklearn_with_threshold(
logistic, x_test, y_test, threshold=optimal_threshold_curve
)
We see that although our accuracy score has improved, our recall score has gotten worse (which is what we are focused on in this situation). Therefore, we can conclude that the previous model (Model 2) has the best performance of the Logistic Regression models but let's compare them to be sure.
# Training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train.T,
log_reg_model_train_threshold_auc_roc.T,
log_reg_model_train_threshold_curve.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Initial Model",
"AUC-ROC Curve",
"0.25 Threshold",
]
print("Training performance comparison:")
models_train_comp_df
# Testing performance comparison
models_test_comp_df = pd.concat(
[
log_reg_model_test.T,
log_reg_model_test_threshold_auc_roc.T,
log_reg_model_test_threshold_curve.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Initial Model",
"AUC-ROC Curve",
"0.25 Threshold",
]
print("Testing performance comparison:")
models_test_comp_df
We see a confirmation that we get the best recall score on both our training and test sets with Model 2 (where we set the optimal threshold according to the AUC-ROC Curve. (training recall score: 0.915408, test recall score: 0.892617)
# Finding the coefficients and calculating the odds
log_odds = logistic.coef_[0]
pd.DataFrame(log_odds, x_train.columns, columns=["coef"]).T
# Converting coefficients to odds
odds = np.exp(logistic.coef_[0])
# Finding the percentage change and adding to a dataframe
perc_change_odds = (np.exp(logistic.coef_[0]) - 1) * 100
pd.set_option("display.max_columns", None)
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=x_train.columns).T
# Function to calculate different metrics and check model performance
def model_performance_classification_sklearn(model, predictors, target):
"""
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# Function to plot the Confusion Matric with Percentages
def confusion_matrix_sklearn(model, predictors, target):
"""
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# Building the model and fitting it to the training set
model = DecisionTreeClassifier(criterion="gini", random_state=1)
model.fit(x_train, y_train)
# Checking the Accuracy of the Models fit to the training and test sets
print("Accuracy on training set : ", model.score(x_train, y_train))
print("Accuracy on test set : ", model.score(x_test, y_test))
#Checking number of positives in Personal Loan
Y.sum(axis = 0)
Out of 5,000 values, 480 are positive, if we simply marked every value as negative we would get an accuracy of 90.4% which is better than the test set's accuracy but we also need to look at recall and other metrics.
# Checking Recall, Precision, and F1 score of the training set
decision_tree_perf_train = model_performance_classification_sklearn(
model, x_train, y_train
)
decision_tree_perf_train
# Checking Recall, Precision, and F1 score of the test set
decision_tree_perf_test = model_performance_classification_sklearn(
model, x_test, y_test
)
decision_tree_perf_test
We see our model is slightly overfit to the training data so we'll need to fix this later.
# Confusion Matrix of Training Set
confusion_matrix_sklearn(model, x_train, y_train)
# Confusion Matrix of Test Set
confusion_matrix_sklearn(model, x_test, y_test)
Visualizing the Decision Tree
feature_names = list(X.columns)
print(feature_names)
plt.figure(figsize=(20,30))
tree.plot_tree(model,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
# Checking the Gini importance of each feature
print (pd.DataFrame(model.feature_importances_, columns = ["Imp"], index = x_train.columns).sort_values(by = 'Imp', ascending = False))
# Visualizing these values
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='pink', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
# Making a new Decision Tree Model with a max depth of 4
model2 = DecisionTreeClassifier(criterion = 'gini',max_depth=4,random_state=1)
model2.fit(x_train, y_train)
# Checking the Accuracy of the New Models fit to the training and test sets
print("Accuracy on training set : ", model2.score(x_train, y_train))
print("Accuracy on test set : ", model2.score(x_test, y_test))
# Checking Recall, Precision, and F1 score of the training set
decision_tree2_perf_train = model_performance_classification_sklearn(
model2, x_train, y_train
)
decision_tree2_perf_train
# Confusion Matrix of Training Set
confusion_matrix_sklearn(model2, x_train, y_train)
# Checking Recall, Precision, and F1 score of the test set
decision_tree2_perf_test = model_performance_classification_sklearn(
model2, x_test, y_test
)
decision_tree2_perf_test
# Confusion Matrix of Test Set
confusion_matrix_sklearn(model2, x_test, y_test)
# Visualizing the Tree
plt.figure(figsize=(15,10))
tree.plot_tree(model2,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
We see that over-fitting has decreased with the less complex model (test and train values are closer together), and our recall score has improved slightly. Let's see if hyperparameter tuning uses GridSearch helps more.
from sklearn.model_selection import GridSearchCV
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(1,10),
'min_samples_leaf': [1, 2, 5, 7, 10,15,20],
"criterion": ["entropy", "gini"],
"splitter": ["best", "random"],
"min_impurity_decrease": [0.000001, 0.00001, 0.0001],
}
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(x_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(x_train, y_train)
# Accuracy of the new model
print("Accuracy on training set : ",estimator.score(x_train, y_train))
print("Accuracy on test set : ",estimator.score(x_test, y_test))
# Checking Recall, Precision, and F1 score of the Training set
decision_tree3_perf_train = model_performance_classification_sklearn(
estimator, x_train, y_train
)
decision_tree3_perf_train
# Confusion Matrix of Training Set
confusion_matrix_sklearn(estimator, x_train, y_train)
# Checking Recall, Precision, and F1 score of the test set
decision_tree3_perf_test = model_performance_classification_sklearn(
estimator, x_test, y_test
)
decision_tree3_perf_test
# Confusion Matrix of Test Set
confusion_matrix_sklearn(estimator, x_test, y_test)
# Visualizing the Tree
plt.figure(figsize=(15,10))
tree.plot_tree(estimator,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(x_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
# Visualizing Effectivee Alpha vs Total Leaf Impurity
fig, ax = plt.subplots(figsize=(15, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
As expected, as effective alpha increases the total impurity of leaves does as well.
# Training the Decision Tree using Effective Alphas
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(x_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
# Function to get Recall score for training set
recall_train=[]
for clf in clfs:
pred_train3=clf.predict(x_train)
values_train=metrics.recall_score(y_train,pred_train3)
recall_train.append(values_train)
# Function to get Recall score for test set
recall_test=[]
for clf in clfs:
pred_test3=clf.predict(x_test)
values_test=metrics.recall_score(y_test,pred_test3)
recall_test.append(values_test)
# Plotting Recall vs Alpha for the new model
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
# Creating the model with the highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
# Checking training set performance for the new model
decision_tree4_perf_train = model_performance_classification_sklearn(
best_model, x_train, y_train
)
decision_tree4_perf_train
# Plotting training set confusion matrix for the new model
confusion_matrix_sklearn(best_model, x_train, y_train)
# Checking testing set performance for the new model
decision_tree4_perf_test = model_performance_classification_sklearn(
best_model, x_test, y_test
)
decision_tree4_perf_test
# Plotting testing set confusion matrix for the new model
confusion_matrix_sklearn(best_model, x_test, y_test)
# Visualizing the Tree
plt.figure(figsize=(20,30))
tree.plot_tree(best_model,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
We see that while the recall score has improved from Models 2 and 3, this model is extremely complex and does overfit.
# Comparing the training set performance of all of the Decision Tree models
models_train_comp_df = pd.concat(
[
decision_tree_perf_train.T,
decision_tree2_perf_train.T,
decision_tree3_perf_train.T,
decision_tree_perf_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree 1 sklearn",
"Decision Tree (Max Depth: 4))",
"Decision Tree (Hypertuned)",
"Decision Tree (Post-Pruning)"
]
print("Training performance comparison:")
models_train_comp_df
# Comparing the testing set performance of all of the Decision Tree models
models_train_comp_df = pd.concat(
[
decision_tree_perf_test.T,
decision_tree2_perf_test.T,
decision_tree3_perf_test.T,
decision_tree4_perf_test.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree 1 sklearn",
"Decision Tree (Max Depth: 4))",
"Decision Tree (Hypertuned)",
"Decision Tree (Post-Pruning)"
]
print("Test set performance comparison:")
models_train_comp_df
We see that we get the best recall score for our training data with Model 2 (which had a max-depth of 4). But the best recall score for our test data comes from our 4th model created with Post-Pruning. The 4th model has a higher test recall score than its training recall score (which is the metric important in this situation) and that it's accuracy remains consistently high so we can conclude that it is the best model to use for this situation despite its complexity.
# Function to create stacked bar-plots
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
stacked_barplot(df, "Education", "Personal_Loan")
sns.histplot(data=df, x="Income", hue="Personal_Loan")
stacked_barplot(df, "Family", "Personal_Loan")
sns.histplot(data=df, x="CCAvg", hue="Personal_Loan")
stacked_barplot(df, "CD_Account", "Personal_Loan")
sns.histplot(data=df, x="Experience", hue="Personal_Loan")
sns.histplot(data=df, x="ZIPCode", hue="Personal_Loan")
sns.histplot(data=df, x="Age", hue="Personal_Loan")
sns.histplot(data=df, x="Mortgage", hue="Personal_Loan")
When creating a marketing campaign, it is important to know who the target demographic is. Based on the findings from the EDA and models, here are some key takeaways/insights for the marketing team to inform their campaign decisions: